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ABSTRACT ^ ; ' ^ 

When criterion-referenced tests are used to assign ' 
examinees to states reflecting theiir performance level on aVtest, the 
better known methods for determining test length, which coasider 
relationships among. domaiW scores and* errors of measurement, have 
their limitations. The purpose of this paper is to present a computer 
system named TESTLEN, which allows test developers to determine 
optimal criterion-ref erehced , test lengths via simulation according to 
user-specified conditions. Such conditions may include ability 
distribution, item statistics, c^t-off score ^"advAikcement score, testK 
model, number of ^examinees, and number of replications. Output 
includes item statistics jand values of decision consistency, kappa, 
and decision accuracy for each replication. The mean, range, and 
standard deviation of decision consistency, kappa, and decision 
accuracy are reported across replications. Also reported is the 
proportion of the fexaminee group assigned to each mastery 
classificjation. A school district developing a test as a diagnostic 
examination is used .as an example of TBSTl,EN*s practical application. 
Recommendations and directions for , the use of TESTLEN appear in- the 
appendix. (Author/AEP) 
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A Method fpr Determining the Length of Criterion-Referenced l^,^ 
Teatiii Ualng RellabHity find Validity Imllcoa^2,3 ||-J 

• . . ^ ^ .... ' It 

Craig N. Milla and Robevt Simon^^^ ' * | | 

Univeroity of MaBaachuoetto^ Amharat |vl 

■ < • ■ ' ••P' 

Criterlon-reference-d teats are used to determine an examinee's 

m T , 

Status with respect to some well-defined domain of behavior (Hambleton «. 
& Eignor, 1979; Popham, 1978).- Conatruttion, of a criterion-referenced 
test (CRT) usually involves (aniong other thing^ drawing a representative 

sample of items from a pool of items Which 'measures the domain of 

,^ ■ ■ \ .■ ■ - 

content of interest. 0^ central importance in the test development ^ 

process is the determinlition of the number of items to be included. 

The , length of the test (or subtests if several objectives are measured 

in a test) is directly related to the usefulness of the score^. In 

genei^l, short tests lead to less reliable and valid scores t;han longer 

^tests. Longer tests, however, while generally resulting in more 

\ . ■ ■> I \ 

precfse estipiates o^ ability , require fhore testing time and may cause , 

'■"■.■*... <^ i . * 

examinee fatigue if they become very long. Also, siQce it is often the 
cd^e /that several objectives are assessed in a single CRT,, practi^l 

- — i-~ — - ■ ' '■ ' •( 

/ ' paper presented at the annual meeting of the American f 

Educatiional Research Association, Los Angeles, CA, April, 1981. 

•# . - ■ f ' 

'^Thls project was, performed pursuant to a contract from the 
U.S. Air Force Human Resources Laboratory. However, the opinions do 
not necessarily reflect their position or policy, and no official 
endorsement by the Air Force shoi4l,d'~be inferred • 

^ Laboratory of Psychometric and Evaluative Research Report No. 110 , 
Amhers.t, MA: School of/ Education, University of M«tssachusetts, 1980. 

'^This project was performed under t^e direction Qf Dr. Ronald K. 
Hambleton. The authors wish to thank him for his direction, criticism, 
and advice throughout the project and the preparation of this paper. 



... -2- 

conalderaclons argue against a larg^ number of Items per objective, 

' - t ■ , ■ • \ 

It Is important, therefore, that criterion-referenced testa contain 

i ■ ' ' , * 

enough Item's to yield scores with desired levels of reliability, 
and validity without requir^ excessive amounts of testing time.. 

^ In many Instances, the purpbse of a CRT is to provide an 

eatimatlB of an examinee's domain score with respect to*an objective 
(or competency) of interest. In^ such a case, when the purpose Is' 
to estimate a domain score, the rel^ionshlp among domain scores, 
errors of measurement, ,and test length can be used to determine an 

. optimum test length (Lord & Novick, 1968). ' : / 

The primary use of CRTs is, however, to assign examinees to 
categories or states reflecting levels of performance in relation to 
the objectives measured in a test. When mastery decisions are being 
made it is possible to determine test length in relation to the 
nurtber of misclassif Ication errors which can be tolerated: ^ The purpose 
of this paper is to describe a system, implemented with the aid of .a 
computer, whi^ .can be used to determine test lengths which will lead 
to specified levels of classification etrors. First, several procedures 
for determining test lengths will be reviewed. After the brief review,( 
the computer-assiste"d system for determining test length will be 
presented^ 

Methods for Determining Test Length 
Millman (1972, 1973^^ considered the relationship between test 
length, advancement scores, find the probability of misclassif ication 
9f an individua3^with a known domain score by usin^ the simple binomial 



3 y 



teat moc|al, Tha aaaimptiona of the model are wal l-known find can be 
found elapwhere (MlUman, 1972, 1973} Unmblel:on ^ Eignor. 1979; Lord 
& Novlckp 1968) . Mlllman?a tables provide the probability of incorrect 
claaslflcacion of individuals with known domain aCoriui for several 
test lengths, advancement scores, and cut-off scores qn the domain score 
scale. " f 

^ / 

\ Wilcox (1976) reliited the work of Fhaner (197A) to Millman's^ 
(1973) work. An indifferen/e zone is used in the Fhaner-Wllcox 
mejfhod for determining test length. An indifference zone is that 
distance around the cut-off score in which it is assumed that rela- 

\- . ' / 

tively little harm is done when examinees with domain ^cores on that 
interval ^e misclassif ied. Certainly in most instructional situations 
such misclassif ications result in only short-term, assignment to 
instructional sequences. Masters who are close to the cut-pff score 
who are misclassif ied as non-masters may benefit from a short remedi- 
ation sequence. Non-masters who are incorrectly classified as masters 
will*, in all likelihood, be quickly identified. The more serious errors 
are those wHlch misclassify individuals who are farther from the cut-off 
score. The binomial model is utilized in the work of ^Ir^er and Wilcox 
to determine test lengths whic^ reduce, to specified levels, the like- 
lihood of misclassif ication of individuals at the ends of the indiff er- 
ence zone, - 

^ , . .■ # 

Two problems limit the usefulness of the systems described above. 
First, % very good prior estimate of an examinee's domain score is 
required with the Millman method. Since the purpose of- the test is 
to estimate the domain score, such a prior estimate will, in all . 



llk«ilihoqd/ not he available ov^ If ^It nvallable, it miy bo 

lmpi:6cla#. vSeoond^ MiltTn^m'a work dcitiermincia optilmal temt langth v 

for^examli^eea HC^a. single point on tbe dcjfcatn acore acnl^s. (Tha 

ayatem deaffi^Qd by Wilcox [1976] coivfidora only examlaeea at t.wo 

.points on che dornaih scale.) That 1«, the MUlman Snd Fhaner-Wllcox ^ 

•I • u , • , H 

methods determlno teat length for -apaclf ic Indlvlduala ; the two rtiethoda dct' 

m . • . ^ . ■ ' ■ ■ ' ^ 

not consider the cnae when a gTQURr of examineea is of interest. To 
the extent that a^group of j^eKamln^ea with varying domain <ecorea la 

to be tested ^ the two pystema described above ^ill result' in teat 

. # ' ' ,' 

lengtha which are not optimal for the' group, Ustialiy, the resultant 

teists will be longer than necessary to achieve adequate levels of 

■J • . ^ , ■ ' _ . , »^ 

decision Siccuracy. ' \ y . I 

. _ / ^ . ' _^ ■ 

-N| , Many test developers want to determine test lengths which will 

• . ;■ V ■ i 

ac^ieve d|||i^ired ' leyels of reliability and validity for a group .of 



examin^e^'. What is needed in case is a system which Incorporates 

group Information i©to the decision regarding test length. Eignor and 
Hambletbn (19^9) and, Eignor (1979) utilized group information when ^ 
Investlgatlttg the relationship between test length and several 
/cr^iterlop-referenced measures of reliability and validity • Usirig the 



simple binomial model and the ^compound binomial ^model (often considered 
a more plausible model tl^an the simple binomial model for e:>tplaining. 
examinee performance) , Eignqr aad H^mbleton (1979) produced graphs of' 
several reliability and validity Indices for tests of various lengths 
for five substantially different domain score distributions (cut-off = 
• 80), Several distributions were""" needed because measures of decision 
consistency and de^l^l^n accuracy arc dependent upon the Ibcation of 



th« diatrlbuclon of ability in'raltttiou to, thw. cut-oft acqre- Kig^ior. 
Miul Httinblistoii. clearly damunatratail .that ertta^lc'in-Vaf aranced meaauraa 
of relUblllry v^ilidUy d«t:vttM«e '^h t\m ilUt;vttMU;iorror (lomlu ^ 
flcorea niovaw townrU (or cMUturH ov«r) l:h»a cut-off wcorw. l^gooK (19/9), 
conaidered addicionnl teat lauijtha, domain acora dtacrihiitiona, and 
advancemanc acorea. The tableH and grapha pr«saented in chq'acvidiea 
cited above abould providl! uaeful guidellneff for practitionara wJu) 
are concerned with doterniln.iag'xoptlinal . teat lengtha. Ac leaal throe 
lln^ttations exist, however, in the Elgnor^ilambleton aolution to the v 
test^ length determinatiori' problem. If a. ^tesfc developer feela that ^ 
the group oE examinees of- Interest haa d aoirfewhtft dlf £,erent dla.trlbutlon 
of domain scores than those consiidercd in; the two s^ud^ea; th(/E|.gnor-- , 

' '\ * " ; ' ' • 'I ■ ' - I • V 

Hambleton graphs will be ofj limited value. SimllarlyV the 'valtie .of ' J ^' 
the information provided by the |two studies is xeducelb rf&nslder^aMyv J >v' 
^^^.jf^4^ t^^t lengths aad -adyancem^rtt scores'" 

difefftY.enlfel^M rWorted.- / Thixd,'; i? the.;item 'pool fo be/used in 

the.;tesj: is hot similar t;o pAe fof|'th<fe item polfS' used in the'' Eignor- 



o 

Hambleton solution,, the resjulf s wil'i ba limired in vklue, 4n summary, ^ 
thei;^^ tables are not sufficien'tly flexible t;^ satis^Jthet^t^quirements 



of many, testing situations. . A better system" would be one' inNwhich test 



developers could more closely/ simulate local conditions ^nferolling 
the distribution of domain scores and the .i:ange of, item statistics ^and 
then.^consider the consequencls of various test lengths 'and: adv^cement 
scores^on the statistics of Interest, In the neoct section: of the paper 
such a system is described. / . . V . / / 



TUmiN^i A System for OetarmLang tlm I.tJiigtH 

One nmthcut by which upttiiml t^fcit l^nWut* can ha iiattivmlima U to 
mUmlMt-n lociil condUlunw on 4 culmput:^it^ The FOHTliAN ptogvaui Tt^HTIVfiN 
lu to allow w^tna^ to kpi^alfy iot.al coiutitlooii ami 

Ti). ftlmulatj^ r,aat pat foi'mao*t«a * Hy almuUt lni H^viat'al pi)t4fciit>le taat 
Ittngttiii ai^ii iuit; -off lacoi-tsfci uato a can ohtalo datrtmaCua or vai loiia 
acatlatica ot int^va^r. , Th^ v^iluaa ohCaUiad may ihau ha uaad to make 
(leclaloiih ragiaviiloH optimal taat langthu. An a taault, voipi Irrimauta 
for xaat davalopmaut ot Itam MiUa^H'.Iou ata c^latlt tail, 

TESTLKN wlil a lmulatij para lle l admlnlatrat lona ot" aavatal 
crlterloii-raferenced teafcH. Characteriat { ca of tha teats ,(t;a-st 

length, cut-off, diattlhutlon of item patamutora). and character iatica 

\" 

of^he examinee^pool (number of examlnoea, dlattlhution of domain 

y ■ . 

\ scores) are Under user control. Alao, und^ar user control ia the number 
of vcpllcationa of eac"h parallel form administration to he simulated. 
Multiple replications allow users to determine the atnhillty of the 
results. A brief description of the optiohs.aVailable lu the program 
la provided in this section of the paper. Detailed instructlona for 
using the program can be found in Appendix A. 

Using TESTLEN with Item Response Data * > 

If users have field tested a set of items, it may be desirable to 
know the effects on decision consistency and decisfon nccuracv of forming 
parallel te^ts by choosing specific subsets of the items. If examinee 
reponses have been scored (l if correx:t and 0' otherwise) the data 
may be used as input for the program. If data on ah ^external criterion 

^Source defcks may be obtained by writing to the authors at the 
Laboratory of Psychometric and Evaluative Research, School of Education, 
Univer«5ity of Massachusetts, Amherst, MA 01^03. In order to cover costs 
of duplication of source decks, computer cards and mailing, checks in 
the amount of $25.00 made payable to the University of Massachusetts 
;;»shauld accompany requests. ' , 

^ ■ . 
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It m*4y bw that a Udut Im^ (i|«a nl: t i»« idUtlve jH)?tt( luh ut 

mlghc bii ^<llowtl tlue; uuU hunuit lcal Ocim|>^ hat* alway« bnuu n w«4k 

ai't34 In a ciuuuti, but MliiiuUnI luufur t unN^i^uiuul jkititH havw <4twayM 
b««in pm'i'ba!:iua mut Item ulal iHiftti bavu iu>t betsn ri)Uiii' . in (Ul»i 
t^aM« a JiLtnulation whlrli uinf3r:tu4 I lui l^duMulal tiuU modol mlnbt b^i 

cbotiiUI* 

1 

If a simulation which uttlLzoM i he blnombil modul In to bt5 uiw^d 
the uH«r may npoctfy tlus mimber of cxamiueeM out ot 100 thought to 
bv. Ui ouch teiU:l\ of the domain acoiu acal«* Altiunatnly , th%s uuor; 
may chooan valueu to diUMu lbu a hota d I ati iliut Ion which approxlnuit*?* 
the local domain ncore dlutributLon. Beta dlstr Ibut Louh are defined 
on the interval [0, IJ which in the acale on which domain acoreu are 
located. Examples of di« tributions obtained fbr five different beta 
distributions are located in Table 1. Other examples of beta dIstrU 
butlons and the statistics which describe thorn can be found in Novlck 
and Jackson (1974, 112-n3), 

Using TESTLEN with Descri ption of 
Both the Examinees a^i the Items 

If users have information pertaining to item statistics as well 

as an incfication of the distribution of examinee domaiii scores the 

compound binomial model may be used for the simulations. Pertinent 

. 8 • ■ • ' . 
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Mif«» 4 ♦ |i4i diuciittk tuglfliiU^ laieui Hail ifuulctlc*^ 

I lid |i i aiil tr t|(u i iiiiiputtitit hiiiiHiUcii lititilu I it» i ttMacii fiU ilia 

ci(mMi<aJ ttH^, f lics ii^cii mii^r |Muvt»tti a ^iti^t i t(H itHi ^it chct i « «im jMHit 1*4 

h«s iiaiCit ill a tiiut tar LU|i pci i ft ( I iiiciiu a . Huai tMetluuIti fiii i|ci 1 ci mu t lU ti^ 

\ 

ptMilt, (it otM.iIli nhtilliil luvclfa Ml t I i at» I I i t y rtu«l Vrtli<lMy ' lit J»umca 
4 - C 

n th« u^iiu <li:ril n^i , t I iuH t I ou.i I 1 < tMii rilatlnlUti r'n) 
may urKiii. TlsSTLKN will ti;uiMti>rm rlunn^ Ht at IfiNrsi into apjii i>p i i at it 
latoiU tiall pai iimi^l ri , tii tliia * a^ie , Hi** il I rU i f Imi l.oii ol iU^uiiitu 
acnii»n lu spct* 1 I 1 tul in Mu* nmnv way a;i wIumi iIu*^ l)(iU)mlal ii 1 nm I at I oiu* 
aif^ porfoiiufMl. That Iji, thr uMtn may rt^aU la t lu* aumbc-r ol t^xamlinui?} 
out of LOO Lu vacU teurh of the domain score mi alt* or a bota dlMtrl - 
b.utioa may btl spdelf iVd. 

If lattMit trait theory, is famLii<ir to the user, lateat. trait 
parametei;-3 for items and examinees may be used* la this case,, each ' 
parameter (difficulty, discrimination, psdudo chance, ability) may 
be distributed , either normally with a specified mean and standard 
ileviation or uniformly in a specified range. 



♦I. I tics tUimb«li tjl cArtUiititseto 
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/ ttoi ItiM r«t < til y 

tl«f If* li>ii iHuiri inity , kappa » atui .It;,* hH on airiuafy a*to«r* tupir 
i iit ioiifii ai u i«S|M)i (4^l, 

Klgui *' a c otiUiluM a poi r Ion of n\\ Dut put rnim rluV pr<»K«^«m. T\w 
wtmiiluH,,., uUHziHl (111- tl.Ku. parMm.'t.T loKt-.t l,' lai,.,U tr.iU tiuHh-l to 
g«nerat« rlu.« r.'!jpotisi-:i of 100 eKaminvvu to taiu.lomly parallel {ormn 
o^f acen-lretn test. The 100 examinees we le d i t r Un.tecl norniailv on 
the latent ability scale with mean 0.0 and .standard devlatloiOl • 0. 
[tern dlfficultie-s were .specified to range iro'm -2.00 to +2,00; dLscrlmlna 
tlon ranged from +0.A0 to 4-''. 00; pseudochance values ranged from f0.l5 
to 40. 2')y/l-he cut-off .score was set at ().00 (the center of tl'e 

. V / a 

abl|(lty dlstribiitlon)^and the advancement score wa.s 5 items correct. 
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FtKuro I. Sample outpul (vCm TKSTLKN. This is the second 
of two replications in wh ii^ir 1 at ent trait theory wa^i used to 
simulate thci per formanct* of 100 examinees on randomly parallel 
forms of a 10- item test. 



'As can be agen in the figurfe, this was the second replicatiori 
of the situation. The data at the bottom of the figure provide 
i^^formation regarding the mean, range, , and standar/^^\iation of the 
three statistics of primary interest (decision consistency, kappa, 
and decision accuracy) for the two replications. ' 



An Example of a Practical Application 
of Program TESTLEN 

TESTLEN can be used early in the test development process to 
provide useful data for decision-making. By simulating performance 
at several test lengths with cut-off and advancement scores of 
interest,, developers can obtain estimates of the effect of these 
factors on consistency and accuracy of the test results. Estimates 
of the proportion of examinees who will need remediation are also 
obtained . r -> 

In order to illustrate an application of the program, suppose a 
school district is developing a test which will be used 
as a diagnostic examination. Results will be used to place students 
into an individualized curriculum. Fifteen objectives have been 
identified as indic^ators by the instructors of the course. All objectives 
are to be tested with as many items as needed to reach consistent and 
accurate classifications at least 70 percent of the time. The test; 
P||^t not^however, require more than 100 minutes to administer including 
distri'bution and collection of materials. Randomly parallel forms 
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are to be developed and administered to approximately 300 students 
each year. , ^ 

For the first objective, it is desired to classify individuals as hav- 
ing achieved the objective if they have domain stores equal to or greater 
than 0.80. Past experience would indicate that students entering t^e 
course generally have domain scojes greater than 0.50 and that they ' ^ 
are distributed uniformly between 0.50 and 1.00. Unit tests have 
, indicated that items range from easy to moderate in difficulty (p-values 
range from about 0.50 to abput 0.90) and that discrimination indices ate 
all around 0:A0. There appears to be little or no guessing on the items. 

It can be seen from the description above that although the 
number of items used for each objective will vary, it is important to 
use as few items as possible for each objective in order to meet the 
time constraints. Table 2 shows the results obtained from TESTLEN 
for 11 possible test lengths and advancement scores for the first 
objective. The domain scores for. the 300 examinees were distributed 
uniformly between 0.50 and 1.00. Five replications of each test were 
simulated. Means, ranges, and standard deviations of decision con- 
sistency, kappa, and decision accuracy for each test- length and 
advancement score are included in the table. It can be seen that' 
8 items with an advancement score of 6 correct would be needed for 
this objective if desired levels of both decision consistency and 
decision accuracy are to be obtained. 



1 /« 
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Table 2 



Measures of Decision Consistency, Kappa, Vnd Decision Accuracy 
Obtained from TESTLEN for 11 Test Lengths 
and^dvancement Scores 



.80, N-300, 5 replications) 



-f- 



Standard 



Test Characteristics 
Uuinber 

of Advancement 
Items Scores 



Decision Consistency 

SXandard 
Mean Range Deviation 



Kappa 



Mean ^ Range Deviation 



Decision Accuracy 

Standard 
Mean Rcinge Deviation 



2 
3 
3 
4 
5 
6 



2 
2 
3 
3 
4 
5 



7 


.5 


7 


6 


8 


6 


r 


7 


10 


8 



.58 
.72 
.59 
.68 
.66 
.67 

.71 
.67 
.72 
.70 
,73 



.07 
.08 
.04 
.07 
.03 
.06 

.06 
.08 
.09 
.05 
.06 



.026 
.033 
.018 
.031 
.012 
.023 

.023 
.032 
.035 
.019 
.027 



.16 
.17 
.20 
.27 
.33 
.34 

.35 
.35 
.39 
.38 
.47 



.14 
.23 
.08 
.12 
.08 
.11 

.13 
.16 
.21 
.10 
.11 



.055 
.084 
.032 
.057 
.030 
.047 

.051 
.063 
.084 
.042 
.050 



.65 
.55 
.71 
.64 
.71 
.74 

.66 
.78 
.70 
-.75 
.80 



.02 
.02 
.05 
.02 
.05 
.05 

.06 
.07 
.04 
.06 
.04' 



.007 
.009 
.019 
.012 
.023 
.020 

.022 
.028 
.014 
.029 
.016 



ERIC 
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Recommendations for Use 



The- purpose of TESTLEN is^to allow test developers to determine 
optimal criterion-referenced test lengths via simulation, - In this 
section a^ew general recommendations regarding use of the program 
, are provided. , , 

'It is not always possible to accurately specify characteristics 
of examinees and item pools. In such cases test developers will 
probably want to ert on^the side of conservatism since ±t may be ' 
better to have a few extra items than to err on the short side and 
have §n unacceptable number of classification errors. The following 
recommendations are intended to provide guidelines for producing con- 
servative test lengtlis. First, use sample sizes similar to the number 
of examinees to be tested. Larger samples will yield more stable 
estimate's of reliability and validity, but test developers need to 
know- the expected range of these statistics in their situation. 
Second, when inVdoubt about the distribution of domain scores, it is 
better to center the distribution close to the cut-off score. The 
closer the distribution is to the cut-off, the more classification 
errors will result. Thus, more items will^ be required to reach ac- 
ceptable levels of decision consistency. Third,- if chardcteristics 
of the. item pool are not established, specify heterogeneous pools. 
This will lead to more conservative estimates of test length. 

TESTLEN simulates parallel-form administrations of "criterion- 
referenced tests. Some options of the program allow the user to choose 



16 



between randomly or statistically parallel tests. If two tests are 
to be developed by randomly selecting items fro^ an item pool, the 
user should specify randomly parallel tests. If, however, the tests 

are to be matched on itrem ^statistics , the user should choose the 

\ , . ' ; •. . ' ^ 

option for statistical parallelism. This option wuld also be chosien 

if only one test form is . to be developed . , iChoosing statistical paral- 

. - V ^ ... ' ■ ■ 

lelisTO for the simulation would be akin to Investigating a test-ret'est 
situation with one form. . ^ « 

' ' Most of the options irjipluded in TES|:]^kN rely- oj^ a random number 
generator. Users^will pos^ibly^ have tO; modify the program to conform 
to 'the random number generator at thei^r facility .-^ Users sliould also 

detennine the type o/ seed which produces best results with the random 

>- , . ■ 

number generator. - * . . 

Finally, the test length determination problem must be solved 

for each objective (or competency) on the test for which mastery 

decisions will 'be made, ' 



Summary 

•In this paper, several methods for determining the length of 
criterion-referenced tests which are used to make mastery decisions, 
were reviewed. For various reasons,* the methods wer^ considered less 
than ideal. ^ A method which utilizes reliability and validity data 
to determine ,test length, implemented via the use of a computer, was 
presented. The method utilizes data which are relevant to tl^e local 
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situation to .determine test lengths. Among the variables under user 
control are test model, number of items, number bt examinees, ability 
distribution,* cut-off score, advancement score, and number of repll- 

cations (parallel administrations) ta be conducted. Options also 

' ' f. ' ■ / • , 

exist to allow the utilization of actual, , rather than simulated 

reisponse' data. i / ^ 



r. 
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APPENDIX A 
Directions for Using Program TESTLEN 
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-Al- . ■ 

t The purpose of this handbook is to provide step-by-step instructions 
for using Program TESTLEN. It is assumed that th^ user has knowledge of 
format statements. That is, the user understands that a variable which 
is specified to have a format of F5.2 is a real number with two decimal 
places. The format 15. refers to a five place integer. With the exception 
-of a random number generator, the -program is .pretty flv^ch machine independent. 
Although all options have worked satisfactorily no claim is made that the 
program is error free._ ' . 

' In ord^r to use this handbook, the user' must answer certain questions 
about the Simula t ion -whj[j::h is desired. Based on the answers to, each ques- 
tion, the „user is referred to certain sections of ^e handbook where detailed 
instructions for setting up input are provided. After the instructions 
,an example is provided. 

Input parameter cards are to be located on Unit 6. Output is written 

' f 

to Unit 1. The first card of any TESTLEN run contains only one variable- 
This variable (NJOBS, Format = 15) directs the program as to how many 
different simulations or <}ata sets are to be processed-. The nser/Shcjn4d go 
through the directions in this handbook once for each job specified. 

' ~ . ' ■ ■ • ■ ^ ■ ^ 

. ' Directions for Assembling Input Decks 
I. What type of simulation is desired ? \ 

If currently available item response data is to be used, go to 11^^ 
For example, "data from a pilot administration might be available.- 
Subsets of items can be organized into parallel tests and results 
calculated. 

If a binomial simulation- is desired, go to III. The description of 
the examinee population is under user control. 

If a compound binomial simulation is desired, go to IV. The 
description of both the examinee population and the item pool 
are under user control. 



II. utilization of Item Response Data 

« 

' ^A, Is an external criterion measure available ? An external criterion 
is a. measure ''other than the test of interest which can be used to < 
separate examinees into mastery categories. Another test or course 
grades might be used. The agreement between the classification of 
examinees on the test of interest and on the external criterion can 
be an indication of the validity of the test. 



If 

'If there is an external criterion, go to A. 2. 



there is not an external criterion, go to A.l, 



A.l. The item responses should be located on Unit li. The input 
• deck should.be set up as follows: 

CARD 2: INTYPE, NITEST, NEX, CUT, CUTS, NREP, N, IPAR 

INTYPE(I5) = 1 

NITEST(I5) => THe number of items to be included in 
each form. For example, if responses 
are to be organized into two parallel 
tests of 20 items,^ NITEST=«20. NITEST 
cannot exceed, 50. 



NEX(I5) The numbei: oLt^^aminees (cannot exceed 1000) 

CUT(F5.2) ^ Set to O.Cp. This variable is not used 

when an external criterion is not available. 

CUTS(I5) The advancement score. This is the number 

of items an examinee mus^t answer correctly 
a. • to be classified as a master on the test. 

NREP(I5) The number of parallel administrations to 
be included in the current job,* 



N(I7) = Seed for the random number generator. 

e ^ 

IPAR(I5) = Set to 0. 



CARD 3: FMTIR 

FMTIR(15A1) = The format by which the responses are 
to be read from Unit 11. 

Example ; Suppose data is available on 20 items and an instructor 
wants to separate the itefns into two parallel tests 
of 10 items each. 200 examinees took the items. 
The advancement score being considered is 7. The 
items] are in fields ofj 2 on the datn tape.. The deck 
, would be as follows: ♦ 
■ • r> o 
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Card 2: 1 10—200-0.00 7 19834513 -0 

Card 3: <5012) 



A* 2 . Is the external criterion on the same f ile^ as the item 
responses or is it on a different file ? 

If the external measure is on the same file, go to A. 2. a. 

If the external measure is on a different file, go to A.2.b. 



A. 2. a. The item responses and the criterion measure should be 

on Unit 11. The criterion measure should follow the last 
^ response. The input deck should be set up as follows: 

CARD 2: INTYPE, NITEST, NEX, CUT, CUTS, NREP, N, IPAR 

INTYPE(I5) = 2 



NIJEST(I5) 



NEXXl5) 
CUT(F5.2) 

CUTS(I5) . 

NREP(I5) 

N(I7) 
IPAR(I5) 



The number of items to be included in 
each form. For example, if responses 
are t^p be organized into two parallel 
tests of 20 items, NITES'E*'20. NITEST 
cannot exceed 50. 



The number of examinees (cannot exceed 1000) 

The cut-off score on the external 
criterion.' If, for example, the 
' xternal criterion were grade point 

.average, the cut-off might be .3. 00. 

\^ \ 

The advancement score-^ ^his is the 
number of items an examinee must answer 
correctly to be classified as a master 
on the test. 

The number of parallel administrations 
be included in the current job. 

Seed for the random. number generator. 

Set to 0. 



CARD 3: FMTIR 

FMTIR(15A1) 



The format by which the responses and 
the external criterion are to be read 
from Unit 11. 



no 



Example ; 



Suppose data is 
district wants 
parallel test;s 
took all of the 
The external or 
and the cut-off 
of 10 is being 
are in fields o 
following in a 
as follows: 



available on 30 items and a 
to separate the items into two 
Qf . 15 items each. 300 examinees 

items on two different occasions, 
iterion is previous' course grades 

is 2.75* An advanaement score 
considered. The item responses 
f 1 on the data tape with GPA ' 
field of 4, ^Th^ djck would be 



Card 2: 2 15—300-2.75 10-^ 28763547 0 

Card 3: ^(3011, F4.2) 



A. 2.b. The item responses should be located on Unit 11. The 
external criterion should be l^ated on Unit 12. The 
input deck should be set up as^ollows : 

V CARD 2; INTYPE, NITEST, NEX, CUT, CUTS, NREP, N, IPAR 

>INTYPE(I5) « 3 

NITEST(I5)- « The number of items to be included in 
each form. For example, if respjjAises 
are to be organized into two parallel 
tests of 20 items, NITEST==.20, NITEST 
cannot Exceed 50. 

NEX(I5) « The number of examinees (cannot exceed 1000) 

CUT(F5-2) . « The cut-off score on the external 
criterion. If, for example, the 
external criterion were grade point 
average, the cut-off might be 3,00. 

CUTS (15) ^ The advancement score. This is the 
number of items an examinee must 
answer correctly to be classified as 
a master on the te^t. 

. NREP(I5) = The number of parallel administrations 
" to be included in, the current iob. 

N(I7) » S6ed for the random number generator. 

IPAR(I5) « Set to a. 

CARD 3: FMTIR 

FMTIR(15A1) = The format by which the responses are 
to be read from Unit 11. 



CARD 4: 
Example ; 



FMTEX 

FMrrEX(15Al) 



» The format by which the' external 
criterion is to be read from Unit 12, 



Card 2: 
Card 3: 
Card 4: 



Suppose data is available on 16 items and a 
district wants to separate the items into two 
parallel tests of 8 items. 1000 examinees 
took^the items. The advancement score being 
considered is 5. The, external criterion is 
teacher ratings; 1.0=Tnaster, O.0=non-master . 
deck would be as below: 

3 8-1000-1. 00 5 11234567-7—0 

(1611) 
(F3.1) 



The 
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Simulations' Utilizing the Binomial Model 

A. Are the percent of examinees in each tenth of the domain score scale 
to be read in or will a feta distribution be used to describe the 
population ? 

If the user wants to read in the numl^er of people in each tenth of the 
scale, go to A.l. - 

If a beta- distribution is to be, used, go to A. 2. 
A.l. The deck should be set up as follows: 

CARD 2: INTYPE, NITEST, NEX, CUT, CUTS, NREP, N, IPAR 
INTYPE(I5) =4 . 

NITEST(15) = The number of items to be included in 
^ , . ' * each form. For example, if responses 

are to be organized into two parallel, 
tests of 20 items, NITEST=20. NITEST 
cannot exceed 50. 

NEX(I5) ^ = The number of examinees (cannot exceed 1000) 

* 

CUT(F5.2) = The cut-off score on the domain spore 
scale. The cut-off score ^is a number 
between 0.00 and 1.00 which represents 
' • the domain score at which examinees are, 

cbnsidered to be masters. 

CUTS (15) « The advancement score. This is the 
• , ' ftumber of items an examinee must 

answer correiptly to be classified as 
a mastex^on the test. 



NREP(I5) - The number of parallel administrations 
^ to be included in the current job. 

N(I7) » Seed for the random number generator. 

IPAR(I5) » Set to 0. 

CARD 3: AREA(I) , I - 1. 10 



AREA(1)(F5.0) 



AREA(2)(F5.0) 



AREA(3)(F5.0) = 



AREA(A)(F5.0) 



AREA(5)(F5.0) = 



AREA(6)(F5.0) 



AREA(7)(F5.0) 



AREA(8)(F5.0) 



AREA(9) (F5.0) 



The number of people out of 100 who 
are expected to have domain scores 
on the interval [0.00, 0.10]. 

The*nun3ber of people out of 100 who 
are expected to have domain scores 
on the interval [0.11, 0.2Qlv' 

The^ number of people out of 100 who 
^ expected to have domain scipVes 
:he interval 10. 21, 0. 30] . / 




of people out of 100 who 
^^.^PlFectpd td have domain scores 
on the interval [0.31, O.AO], 

The number of people out of 100 who 
are expected to have doriiain scores 
on the interval [O.Al, 0.50]. 

The number of people out of 100 who 
are expected to have domain scores 
on the interval [0.51, 0.60]. 



= The number of people out of 100 who 
are expected to have domain scores 
on the interval [0.61, 0.70]. " 

= The numbe-r of people out of 100 who 
are expected to have domain scores 
on the interval [0.71, 0.80], 

= The number of people out of 100 who 
are expected to have domain scores 
on the interval [0.81, 0.90]. 



AREA CIO) (F5.0) = The number of people out of 100 wha 
are expected to have domain scores 
on the interval [0.91, 1.00]. 

These 10 numbers must total 100. 

Example: Suppose an instructor plans to test 500 examinees on a 
10 item test. The cut-off score is to be .75 and the 
instructor wants to investigate the effects of an 
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Card 2: 
Card 3: 



advaitcement score of 7. There la a small group of 
students (about 10%) who are definitely very low- 
perfonnera. ' The rest seen to be fairly evenly dls*« 
trlbuted in the top forty percent of the domain score 
scale. Five replications of the simulation are 
desired in order to get a feeling for the range of • 

possible values » the input deck might be as follows: 

' * ** • » 

-4— 10—500-0.75—7-— 54395183-.— 0 

0~r-5 5- 0 ^0— '0 25 25—20 20 



A,2» The deck should b^ set up as follows: 

CARD 2: INTYPE, NITEST, NEX, CUT, CUTS, NREP, N, IPAR 
INTYPE(I5) =5 

NITEST (15) » The number bf items to be included in 
each form. For example, if responses 
are to be organized into two parallel 
tests of 20 items, NITEST=20. NITEST 
cannot exceed 50. 



NEX(I5) = The number of examinees (cannot exceed 1000).. 

CUT(F5.2) »i The. cut-off score on the domain score scale. 

The cut-dff score Is a number bewteen 0.00 afnd 
1.00 which represents the domain score at 
which examinees are considered to be masters. 

CUTS(I5) « The advancement score, this is the ^ 
number of items an examinee must 
answer cort^ectly to be classified as 
a master on the test. 

NREP (15) = The number of parallel administrations 
to be included in the' current job. 



Seed for the random number generator. 



N(I7) 

IPAR(I5) = Set to 0. 
CARD 3: IP, IQ 

IP(I5) = First descriptor of beta distribution 



1Q(I5) = Second descriptor of beta distribution. 

Example ; Suppose^a^ test developer wants to determine the effects 
^ of using a 5 item test with a' cut-off of 0.80 and an 

advancement score of 4. Large numbers of examinees 
wijl take the test. Past experience has shovm the bulk 
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of the examln«e8 to be located In the region of the 
cut-off score with a few In ,the region .40 to .60. 
Ten replications are to be conducted. The^ decH may 
be set up as follows: ' 

Card 2: -—5 5-1000-0.80-—— 4 109812375 0 

Card 3: 8 -2 

IV. simulations Utilizing thg Compound Binomial Mo^ 

A. Are late nt trait parameters to be used or will classical statistics 
be reaa In and converted to latent trait values t'" 

If latent trait parameters are to be read In, go to A. 1. 

If classical statistics (p-values, etc.) are to be converted, go to A. 2. 



A.l. 



The user must specify dlstrlbutldns which are desired for item 
difficulty, discrimination, pseudochance, and ability (b, a, c, 
and e, respectively). Two options are available. Each vari- 
able may be distributed (1) normally with a specified mean and 
standard deviation or (2)/ uniformly across a specified range. 
The input depk should be set up as follows: 

CARD 2: INTYPE, NITEST. NEX, CUT, CUTS, NREP, N, IPAR 



INTYPEdS) 
NITEST (15) 



NEX(I5) 
CUT(F5.2) 



CUTS(I5) 



NREP (15) 
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The number of items to be included in 
each form. For eilample, if responses 
are to be organized into two parallel 
tests of 20 items, NITEST=20. NITEST 
cannot exceed 50. ^ 

The number of exami^nees (cannot exceejd 1000) . 

' The cut-off score on the domain score 
•^cale. The cut-off score is a number 
between 0.00 and 1.00 which represents 
"the domain score at which examinees are 
considered to be masters. TESTLEN converts ' 
this value to a cut --off on the ability 
scale for th^ test which is gienerated. 

The advancement score. This is the 
number of items an examinee must 
answer correctly to be classified as 
a master on tjie test. 

The number of parallel administrations 
to be included in the current job. 
oo 



N(I7) Seed for the random number generator » 

IPAR(I5) m 0 ±{ the two tests are to be randomly 
parallel. 

« 1 if the two tests are to be statistically 
parallel (Identiqal Item parameters). ^ 
This option would be chosen if the Iteg- 
pool Is large enough to permit building 
Identical forms or if only one. form Is 
actually to be developed and the second 
* form is used as a hypothetical test for 

simulation purposes only/ 



CARD 3: IB, BBOT, BTOP 
IB(I5) 



• 1 if a normal distribution 1||pitem difficulty 
parameters (b values) is desired. 

» 2 if a uniform distribution of item 
difficulty parameters (b values) is 
i, desired. 

BBOT(F5.2) » If IB«1, desired mean or item difficulties. 

" If : IB«2, lower limit of range of item 
- difficulties. 



BTOP(F5.2) « If IB«1, desired standard deviation of 
item difficulties. 

" • ** ■ 

If IB-2, upper limit of range of item 
difficulties. 

CARD 4: lA, ABOT, ATOP 

IA(I5) « 1 if a normal distribution of item discrimi- 
nation parameters (a values) is desired. 

= 2 if a uniform distribution of Item discrim- 
ination parameters (a values) is desired. 

ABOT(F5.2) « If IA«1, desired mean of item discrimination 
values.' 

« If IA='2, lower limit of range of item 
discrimination values. 

AT0P(F5.2) = If IA==1, desired standard deviation of 
item discrimination values* 

« If IA==2, lipper limit of range of item / 
discrimination values. 
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CARD 5s IC, CBOT, CTOP 
IC(I5) 



- 1 If a normal dlacrlbutlon of item 
paeudochance (c values) la desired. 

- «■ 2 if a uniform distribution of item 
pseudachance (c values) is desired. 

CB0T(P5.2) « If J[C«1, desired mean of item pseudo- 
chance values. 

* ' "If IC-2p lower limit o| range of Item 

pseudpchance values. 

CTOP(P5.2) » If IC-1, desired standard deviation 
of item'pseudochance values. 

« If IC»2, upper limit of range of item 
pseudochance values. 

CARD 6: ITHET, THTOP, THBOT 

1THET<15) = 1 if a normal distribution of ability 
(6 values) is desired. 

» 2 if a uniform distribution of ability 
(6 values) is desired. 

» If 1THET«1, desired mean of the ability 
distribution. 

= If ITHETa2, lowei^ limit of range of the 
ability diistrlbution. 

= If ITHET«1, ^sired standard deviation 
of the ability distribution. 

= If ITHET=2, upper limit of the range of 
the ability distribution. 



THBOT(F5.2) 



THT0P(F5.2) 



Example ; 



Suppose -a 7 item test with a cut-off score of .70 
and an advancement score of 5 is being con- 
sidered for use where 150 students will be 
tested on the objective. There is one form of the 
test. The range of item difficulties is -2.00 to 
2.00; item discriminations range from 0.50 to 1.75 
and guessing ranges from 0.15 to 0.25. Ability of 
students is expected to be normally distributed with 
a mean of 0.00 and a standard deviation of 1.00. 
The data would be arranged as follows: 

287937 1 
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Card 


2: 


6— --7— 150-0. 


70 5— 


Card 


3: 


2-2.00+2.00 




Card 


4: 


——2+0.50+1.75 




Card 
Card 


5: 
6: 


2+0.15+0.25 

1+0.00+1.00 


30 



Are cha percant of ttxamltiees In each Interval of the domain 
score sdale to be read In or will a beta dlatrlbutloii be used 
to describe the population ? 7" / ^ 

If the number of people in each Interval is to be read, go to A.2,a, 

If a beta distribution Is to be used, go to A.2.b. 

A. 2. a. The deck should be set up aa foT^lows: , 

CARD 2: INTYPE, NITEST, NEX, CU\r, CUTS, NREP, N, IPAR 

11ITYPE(I5) -7 

NITEST (15) « The number of items to be included In 
each form. For example. If responses 
are to be organized into two parallel 
tests of 20 items, NITEST«20,. NITEST 
cannot exceed 50. 

NEX(I5) The number of examinees (cannot exceed 1000). 

CUT(F5.2) =» The cut-off * score on the domain score 
scale. The cut-off score is a number 
between 0.00 and 1.00 which represents 
the domain score at which examinees are 
considered to be masters. TESTLEN con- 
verts this value to a cut-off on the 
ability scale for the test which is 
generated. ^ 

CUTS(I5) « The advancement score. This is the 4 
number of items an examinee must answer 
- correctly to be classified as a master 
on the test* 

NREP(I5) « The number of parallel ^administrations 
to be included in the current job. 

N(I7) Seed for the random number generator. 

IPAR(I5)^ « 0 if the twQ: tests are to be randomly 
parallel. 

= 1 if the two tests are to be statistically 
parallel (identical item parameters). 
This option would be chosen if the item 
pool is large enough to permit building 
identical forms or if only one form is 
, actually to be developed aiad the second 

form is used as a hypothetical test for 
simulation purposes- only. 
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CARD 3: tTM, NCH, PBOT, P^OP, RBOT, RTOP 



LTM(I5) - I If item dtf flculti«a vary, but all 
\f ■ Item 4^«ci:lntlnacion Indices ara very 
aimllai: In value and guesalng la not 
thought^ td ba a factor In tost performance. 

- 2 if item dlff Iculites and discrimination 
vary, but guessing Is not thought to ba 
a factor "Iji test performance. 

< ■ ■ 

/ - 3 if item difficulty ^nd discrimination 
vary and gu^^sing is thought to affect ' 
teat perfonnance, ' • 

I ■ . . , • 

NCH(I5). » The number o| options par item. 

PBOT(F5.2) - The lower lim^t of the range of item 

difficulties (p values) to be included , 

in the test, ^ 

■ > ' 

PTOP(F5.2) - The upper limit ,o! the range of item 

difficulties (p values) to he included ' 

in the test. -r'-:-^--'-"'- 'J./, ' 

RBOT(F5.2) = The lower limit of^the range of item 

disctimihatipn indices (r values) to be 
included in the test. 

RTOP(F5.2) « The upper limit of the rangd' of item 

discrimination indices (r values) to be 
included in the test (RBOT^RTOP if LTM»1) . 



CARD 4: ITHET, THBOT, THTQP 
ITHET(I5) = 4 
THTOP(F5.2) = set to 0.00. 
THBOT(F3.2) =» set to 0.00. . 



CARD 5: AREA(I) , 1=1,10 



AERA(1)(F5.0) = The number of people out of 100 who 
are expected to haye domain scores 
on the interval [OVOO, 0.;iO]<i 



AREA<2)(F3.0) - 'fha numbar of psopio out of XOO who 
arc «kp«ctcid, to hava domain •coraa 
on ch» incarval (0.11, 0.20]. 

AREA(3) (F3.0) - Tha nunibar of paopla out of 100 who 
ara expacted to hava domain acoraa 
on tha lntarval"(0.21, 0.30J. 

AREA(A) (F5.0) • The number of pi'oplo out of 100 who 
are expactad to hnve domain l^coros 
on the Interval (0.31, O.AOl.' 

AREA(3) (F5.0) « The nuinher of people out of 100 who 
are expected to have domain scores 
on the interval (0,41, 0.50]. 

AREA(6y(F5.0) « The number of people out of \oO who 
are expected to have domain scores 
- on the interval f0.51, 0.60], 

AREA(7) (F5.0) » The number of people out of 100 who 
are expected to have domain scores 
on the interval (0.61, 0.70]. 

AREA(8) (F5.0) « The number of people out of 100 wflo 
are expected to have domain scores 
on the interval {0.71, 0.80], 

AREA(9) (F5.0) « The number of pcoi^le out of 100 who 
are expected to have domain scores 
on the interval (0.81, 0.90]. 

AREA(10)(F5.0) = The number of people out of 100 who 
are expected to have domain scores 
on the interval (0.91, 1.00]. 

Example ; Assume an instructor Is considering testing an objective 
with> randomly parallel tests of 8 items. Items will 
be four option multiple-choice items. The cut-off score 
is 0.75 and the advancement score is 6. A large group of 
students are perfoirming at high levels' and another group 
is performing at a moderate level. Tlie remaining students 
are evenly distributed between the extremes. There is a- 
wide range expected in both difficulty and discrimination 

indices and guessing will probably be a factor. Five 

* replications are desired on a sample of 120 students. The 

data could be arranged as follows: 



EKLC 



C«ird 3 J -—'.3~-.''-i4«Q. 30-0.80-0, 23-0. 65 
Card 4; --—4-0,00-0,00 
C«»d 3 : — -"O— — 0-— «0— — 0— -aO-^-lS S^-ii^^-S^U".— 20 



.)>. d«ck should b« «rrang«d aii follow* i 

CARD 2 J INTYPI?, Nixist, NEX, CUT, CUTS, NREf, N, IPAR 
> INTYPEdS) m\7 |\ 



NITEST(I5) " fTh* numb«r of icaau to b« Included in 
lach form. For axampU, if raaponaaa 
fare to b« organlxod Into two parallal 
jbaata of 20 Itama, NlTEST-20. NITEST 
^oannot excead 50. 

^e number of examlnaes (cannot exceed 1000) 

-/The cut-off score on the domain score 
>cale. The cut-off score is a. number 

* betwfeen 0.00 and 1.00 which repreaenta 
iy:ha idoraain score at which examlneaa are 

. (considered to be masters. TESTLEN con- 
Werta this value to a cut-off on the 
aBi^lity spala for the test which is 

* generated. , 

■ £iV^advanceraen^ score. This iq the 
^^mber of items an examinee mus(f/ > 
Answer correctly to be classified: as 
a master on the test. 

The number of parallel admlnlst^.^tlons 
to be Included in the current jb^l 

« Seed for the random number generator. ^ 

« 0 if the two tests are to be randomly 
parallel. \ 

« 1 if the two tests are to be statistically 
parallel (identical Item parametifers) . 
This option would be chosen If the item 
pool Is large enough to permit building 
Identical forms or if only one form is 
actually to be developed and the second 
form Is used as a hypothetical test for 
simulation purposes only. 



■/J 

NExdSy 

J 

CUT(F5.'^) 



• / 

CUTS (13) 



NR#0^) 

N(I7) 
IPAR(I5) 
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CARO 3 1 IW. NCH, P»OT, l»TOI», RBOT, RTOP 

LWdS) i If i%m dAfftcuUian vnry, biu ill 
Urn iiim^t:imimUm U4Umtk trt vtry 
niml Ur In v^liii and guintni is na^ 
thought CO bi « factor In ttit: purrarntimat, 

2 If itiim cliff iculit«i« and diacri«iin«(;ioii 
vary, but: guMaing li not; thought: to b« 

« factor in taat parformincii* 

3 if U«m difficulty and diacrimination 
vary and guaaaing la thought to affect 
taat parformanCa. 

NCH<I5) - Tha numbar of optiona per lt«m. 

PBOT(F5,2) m Tlie lower limit of the range of Itaro 

difficulties (p values), to be included 
lu the teat. 

PTOP(F5.2) - The upper limit of the range of item' 
difficulties (p values) to be Included 
in the test. . | 

RBOT(F5.2) » tne lower limit of the range of item 

dlacrlrnination Indices values) to bo 
Included in the test. 

.1 .. I 

RT0P(F5»2) « The upper limit of the range of item 

discrimination Indices ^r values) to be 
included In the test (RBOT-RTOP if Llli^l) • 

CARD A: ITHET, THBOT. THTOP 

ITHET(I5) - Set to 3 

THB0T(F5.2) « First descriptor of beta distribution. 

THT0P(F5.2) « Second descriptor of beta distribution. 

Example ; Suppose the results of an administration of one 

form of a 4-itera multiple-choice test (5 options) 
with a cut-off score of 0.90 and an advancement 
score of 4 are of interest for a group of 50 examinees. 
Items range from moderate to easy, in difficulty, 
discriminations are all around .A5 and guessing is 
not thought to be a factor. The average domain score 
is probably around 0.85, but a few examinees may be 
at or below 0.50. Only one replication of the simu- 
lation is requested. The data might be arranged as 
below; 

Card* 2: v 7 4— :-50-0.90 4 11239867-- — 1 

Card 3^ — 1 5-0. 50-0. 90-0. 45-0,. 45 

Card 4: 3-6.00-1.00 



